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Abstract 

Background: For genetic association studies in designs of unrelated individuals, current statistical methodology 
typically models the phenotype of interest as a function of the genotype and assumes a known statistical model for 
the phenotype. In the analysis of complex phenotypes, especially in the presence of ascertainment conditions, the 
specification of such model assumptions is not straight-forward and is error-prone, potentially causing misleading 
results. 

Results: In this paper, we propose an alternative approach that treats the genotype as the random variable and 
conditions upon the phenotype. Thereby, the validity of the approach does not depend on the correctness of 
assumptions about the phenotypic model. Misspecification of the phenotypic model may lead to reduced statistical 
power. Theoretical derivations and simulation studies demonstrate both the validity and the advantages of the 
approach over existing methodology. In the COPDGene study (a GWAS for Chronic Obstructive Pulmonary Disease 
(COPD)), we apply the approach to a secondary, quantitative phenotype, the Fagerstrom nicotine dependence score, 
that is correlated with COPD affection status. The software package that implements this method is available. 

Conclusions: The flexibility of this approach enables the straight-forward application to quantitative phenotypes and 
binary traits in ascertained and unascertained samples. In addition to its robustness features, our method provides the 
platform for the construction of complex statistical models for longitudinal data, multivariate data, multi-marker tests, 
rare-variant analysis, and others. 
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Background 

In genetic association studies, individuals are often 
recruited based on case-control ascertainment conditions 
of the primary phenotype [1]. For the analysis of sec- 
ondary phenotypes, this recruitment-scheme can become 
problematic. If the secondary phenotype is correlated with 
the primary phenotype in a case-control study, the distri- 
bution of the secondary phenotype can be fundamentally 
different from the general population. For example, in a 
genetic association study of COPD in which all cases have 



"Correspondence: sharon.lutz@ucdenver.edu 

1 Department of Biostatistics, University of Colorado Anschutz Medical 

Campus, Aurora, USA 

3 Department of Biostatistics, Harvard School of Public Health, Boston, USA 
Full list of author information is available at the end of the article 



COPD and control subjects have normal pulmonary func- 
tion, the distribution of quantitative lung phenotypes can 
deviate substantially from their distribution in the gen- 
eral population. For samples that are ascertained in this 
fashion, standard statistical methods may lead to mis- 
leading results or may lack statistical power to identify 
true genotype phenotype associations. There are several 
methods to accurately estimate the odds ratio of genetic 
variants for binary secondary phenotypes associated with 
case-control status [2-10], but these methods cannot eas- 
ily accommodate continuous secondary phenotypes. For 
the special case that the secondary phenotype is normally 
distributed or binary, Lin & Zeng (2009) proposed an 
adjusted score test that incorporates genetic associations 
with affection status into the test statistic [11]. 
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We present a more general approach that does not 
require any distribution assumptions for the secondary 
phenotype. We refer to the approach as the non- 
parametric population-based association test (NPBAT). 
The approach has a form similar to the Family Based 
Association Test (FBAT), a non-parametric test statis- 
tic that is frequently used in the family based setting 
[12-15]. The flexibility of our approach allows us to con- 
struct a genetic association test for standard and complex 
phenotypes that is non-parametric with respect to the 
phenotype. The class of tests is very general It includes 
most standard association tests and can be applied to mul- 
tivariate traits and phenotypes, multiple genetic markers, 
and case- control studies where phenotypic information is 
available for the cases but correlated with the case-control 
status [16-18]. 

The general concept of the proposed association-testing 
framework is to condition on the phenotype of interest 
and treat only the genetic data as random [12,13,15]. By 
assuming that the phenotype data is deterministic, the 
validity of the approach does not depend on the cor- 
rectness of the phenotypic assumptions. Nevertheless, the 
power of the approach can be increased by incorporat- 
ing a plausible model for the phenotype into the test 
statistic. Based on theoretical considerations and on sim- 
ulation studies, we show that the new approach is robust 
against misspecification of phenotype assumptions. At the 
same time, this approach achieves the same power level 
as standard genetic association tests for population-based 
designs when the phenotype of interest has a normal 
distribution or is dichotomous. For studies where a quan- 
titative trait is correlated with case-control status, our 
simulation studies examine the power and significance 
levels for the proposed approach, which does not require 
any adjustment for the ascertainment conditions. 

We illustrate the practical advantages of NPBAT by 
an application to the COPDGene study. The COPDGene 
study is a case-control study of the genetics of COPD 
in current or former smokers with at least 10 pack-years 
of smoking history [19]. We test the genetic associa- 
tion of single nucleotide polymorphisms (SNPs) in the 
CHRNA 3/5 region and the Fagerstrom Nicotine Depen- 
dence score (FNDS). FNDS is a validated instrument of 
nicotine dependence in current smokers and was mea- 
sured in the current smokers, but not former smokers in 
the COPDGene study. NPBAT, which uses the genotype 
data in both current and former smokers, is compared to 
the published genetic association of SNPs in the CHRNA 
3/5 region and FNDS that was performed in current 
smokers only [20] . 

Methods 

In a genetic association study, n unrelated study subjects 
have been recruited based on a predefined ascertainment 



condition. Let X; denote the genotype of the individual 
/. The specific value of X; will depend upon the genetic 
model under consideration. For instance, for an additive 
model, Xi = 0, 1,2 for 0,1,2 disease alleles, respectively. 
X{ may also be a vector in order to test several alleles 
simultaneously. Let T[ denote the numerical trait infor- 
mation for individual /. For example, Ti could equal one 
for affected individuals and Ti could equal zero for unaf- 
fected individuals. Different coding functions are applied 
depending on the phenotype of interest. For binary and 
continuous traits, we will discuss efficient coding schemes 
below. First, we define a general class of test statistics as 



S=J2(Xi-E x )Ti 



(1) 



i=l 



Note that E(S) = 0 under the null hypothesis of no 
association between the genotype X and the phenotype Y. 
Constructing a conditional score test in which the geno- 
type Xi is the dependent variable and we condition upon 
the numerical trait information Ti, the NPBAT statistic 
has the following form: 



StatxPBAT = 



S -E[S] 
*Jvar(S) 



Z(Xi-E x )Ti 

i=l 



(2) 



where E x denotes the expectation of the marker score/ 
genotype X under the null-hypothesis of no genetic asso- 
ciation between the phenotype. The marker locus. E x can 
be estimated based on the sample mean of the genotypes. 
The asymptotic distribution of the NPBAT statistic under 
the null-hypothesis depends on the estimation of E x and 
on the specification of the trait information Ti, and is 
derived in the Appendix. 

There are various ways to code the phenotype of inter- 
est and define the coding function T{. For the analysis of 
affection status, one could specify the coding function to 
be Ti = 1 or Ti = 0, depending on the disease status of the 
proband. However, as we show in the Appendix A, a more 
efficient way is to set T t = 1 - for the cases, and 

Ti = 0- for the controls. Then the NPBAT statis- 
tic is approximately the same as the Cochran- Armitage 
Trend test. 

If the phenotype Y[ is in fact normally distributed and 
T[ = Yi — Yi where % denotes the fitted values of regress- 
ing the phenotype Y on any covariates, then the NPBAT 
statistic is approximately the same as a t-statistic from a 
linear regression. In general, if the phenotype Yi is a con- 
tinuous phenotype, we recommend Ti = Yi — fi y where 
fly is the phenotypic mean in the general population. 
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While it is appealing that the NPBAT statistic is com- 
parable to standard methods in these simple scenarios, 
the real appeal of the NPBAT statistic is when there is 
only phenotype information available for some subjects 
but there is genetic information available for all subjects. 
For example, in case control studies, an additional quan- 
titative phenotype may be available for the cases but not 
the controls. When testing for a genetic association with 
this additional quantitative phenotype, the NPBAT statis- 
tic uses the genotype of both the cases and the controls 
with the optimal coded phenotype T{ = Y; — Yoffset 
where Yoffset ls a constant. The choice of this constant is 
described in detail in the simulations sub-section and the 
asymptotic distribution of the NPBAT statistic is derived 
in the Appendix. Using this optimal offset choice, the 
NPBAT statistic has a substantial increase in power over 
other methods such as the NPBAT statistic when an offset 
choice of T[ = Y[ — Y or the improved score test, which 
is uniformly more powerful than score tests based on the 
generalized linear model such as the Cochran-Armitage 
trend test, the allelic x 2 test and the genotypic x 2 test [21]. 

Adjustments for population admixture 

The NPBAT statistic can be adjusted for population 
admixture by using standard methods such as prin- 
cipal components analysis or genomic control [22,23]. 
For example, to account for population admixture, one 
can treat the principal components as additional covari- 
ate representing population information, and incorporate 
them into the test statistic in equation (2) by taking T[ = 
Y{ — Yi where Y; denotes the fitted values of regress- 
ing the phenotype Y on the top principal components 
that explain the greatest amount of variability in the data. 
Note the above approach requires that the phenotype Y is 
dichotomous or roughly normally distributed. 

Extension to multiple phenotypes 

The NPBAT statistic can be extended to m phenotypes to 
test the null hypothesis that a marker locus is not linked 
to any disease-susceptibility locus for any of m selected 
phenotypes. Then the test statistic becomes 

n 

S = J2( x i~ E x)Ti (3) 

i=l 

Note that E(S) = 0 as is the case for the univariate 
version above. But here T[ is the m x 1 vector for the m 
phenotypes and X; is just one marker. So S is m x 1. The 
m x m variance matrix is the following 

Vs = alY j T i T t i (4) 

i=l 



where o\ is the variance for marker X based on sample. 
Then the NPBAT statistic is the following 

Xnpbat = S t V s 1 S (5) 

Due to the estimation of E x based on the sample, this 
statistic does not have a chi square distribution and a 
permutation test needs to be used to assess significance 
levels, which can be done by using the NPBAT software 
package (https://sites.google.com/site/genenpbat/). 

Simulations 

In genetic association case-control studies, only the cases 
may have additional phenotypic information available. For 
instance, in a case-control study where the cases have 
asthma (the primary phenotype), only the cases may have 
FEV measurements (the secondary phenotype). In this 
scenario, the secondary phenotype FEV will be more 
severe than it would be in the general population and the 
analysis of this secondary phenotype can be misleading 
due to the ascertainment of subjects based on the primary 
phenotype, asthma. To simulate this scenario, we gener- 
ated the genotype X for 500 cases and 500 controls and a 
secondary phenotype Y for only the 500 cases from a trun- 
cated normal distribution with standard deviation a = 1, 
mean aX under the alternative and mean 0 under the null 
and cutoff such that the secondary phenotype in the top 50 
percent of the normal distribution. We consider an allele 
frequency of p = 20% and a is chosen such that the her- 
itability h [24] equals 1%,2 %, 3%, 5%. The solving for a, 
a = a^/h/2p(l -p)(l-h). 

We compute the NPBAT statistic with the coded pheno- 
type Ti = Y[ — Offset where Yoffset is a constant that ranges 
from -5 to 15 and E x is the sample mean of the genotypes 
in the cases. We also compute the NPBAT statistic with 
E x equal to the sample mean of the genotypes in the con- 
trols and E x equal to the sample mean of the genotypes in 
the cases and the controls. We compare the power of these 
three NPBAT statistics to the Improved Score Test, which 
is uniformly more powerful than score tests based on the 
generalized linear model such as the Cochran-Armitage 
trend test, the allelic x 2 test and the genotypic x 2 test [21]. 
We also compare the power of the NPBAT approach to a 
standard linear regression. 

Under the null hypothesis, the NPBAT method main- 
tains a significance level of approximately 5% or less as 
seen in Figure 1 whether E x is the sample mean of the 
cases or the controls or both. Figure 1 also depicts the 
power results of these simulations. Note that the spike or 
drop in all the plots occurs where Y 0 ff set ^ Y, the sample 
mean of the secondary phenotype for the cases since the 
secondary phenotype is not available for the controls in 
this scenario. The power of the NPBAT approach is max- 
imized when E x is based on the genotype of the controls 
and Y 0 ff set is significantly different than the phenotypic 
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Typc-1 Error Rate for h= 0 01 Power for h= 0.01 




Figure 1 Power and Significance levels for NPBAT, the Improved Score Test and the Likelihood Ratio Test (LRT). This plot compares the 
power and type-1 error rate of the NPBAT method using E x based on the sample mean of the cases, the controls and both the cases and controls. 
The power and significance levels of this method is compared to the improved score test and a standard linear regression. Note that the spike or 
drop in all the plots occurs where Offset ~ /* the sample mean of the secondary phenotype for the cases since the secondary phenotype is not 
available for the controls in this scenario. The power of the NPBAT approach is maximized when E x is based on the genotype of the controls and 
/offset is significantly different than the phenotypic mean of the cases. When E x is based on the genotype of the cases, the power of the NPBAT 
approach is similar to the improved score test and the regression. Note that the power of NPBAT approach when E x is based on the genotype of 
both the cases and the controls is best for high values of heritability. 
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mean of the cases. When E x is based on the genotype of 
the cases, the power of the NPBAT approach is similar to 
the improved score test and the regression. Note that the 
power of NPBAT approach when E x is based on the geno- 
type of both the cases and the controls is best for high 
values of heritability. 

These simulations show that for case-control studies 
when analyzing secondary phenotypes correlated with 
case-control status, we recommend to set Yoffset to a con- 
stant significantly different from the phenotypic mean of 
the sample and E x equal to the genotypic mean of the con- 
trols. In this situation, a robust and efficient choice for 
the offset Yoffset is the phenotypic mean in the general 
population. Note that the results of these simulations are 
analogous to the FBAT statistic in family studies where it 
was found that when ascertaining cases only from a quan- 
titative distribution, one needed to choose an offset that 
was outside the range of the cases phenotypic values [15]. 

Data analysis 

We applied the NPBAT method to the Genetic Epidemiol- 
ogy of COPD (COPDGene) Study which is a multi-center 
case/control study designed to identify genetic factors 
associated with COPD and to characterize COPD-related 
phenotypes [19]. The study recruited COPD cases and 
smoking controls who were non-Hispanic whites and 
African Americans ages 45 to 80 with at least 10 pack- 
years of smoking history. The study also collected the 
Fagerstrom Test for Nicotine Dependence (FTND) to 
assess nicotine dependence, but the FTND score was only 
available for cases and controls who were current smok- 
ers at study enrollment. This data analysis represents the 
scenario where the secondary phenotype (FTND score) is 
available only in current smokers but the genotypic infor- 
mation is available for both current and former smokers. 
In the first 1,000 Non-Hispanic White (NHW) individ- 
uals, the FTND score controlling for age and gender 
was tested for an association with SNPs in the CHRNA 
3/5 region for COPD cases and controls who are cur- 
rent smokers and association was found for rsl051730 or 
rs8034191 [20]. We applied the NPBAT statistic to the first 
1000 NHW using the genotype of both current (307 indi- 
viduals) and former smokers (669 individuals), controlling 
for age and gender and obtained the results shown in 
Table 1 for these 2 SNPs. Note that the NPBAT statistic 



performed better than both the Improved Score Test and 
the regression controlling for age and gender. 

Results and discussion 

NPBAT is a new statistical framework for population 
based genetic association tests that does not require 
making specific assumptions about the distribution of 
the phenotype. By conditioning on the phenotype, 
NPBAT is robust against violations of phenotypic model 
assumptions. The practical implications of NPBAT are 
demonstrated when applied to the COPDGene Study. 
FNDS, a measure of nicotine dependence, was assessed in 
current smokers that represent 31% of study participants 
in COPDGene. We analyzed SNPs shown to be associ- 
ated with FNDS [20]. NPBAT identified the same SNPs as 
conventional methods but with slightly greater statistical 
significance than a linear regression for FNDS control- 
ling for age and gender or the improved score test. Other 
examples of applications of NPBAT are 

1. when a sample is ascertained based on case/control 
status and the phenotype of interest is correlated 
with case status 

2. in a cohort study in which prevalent cases are 
excluded (i.e. the classic epidemiologic cohort study) 
and the phenotype of interest is correlated with the 
disease of interest 

3. a pharmacogenetics study using a randomized 
clinical trial when participants are ascertained based 
on the levels of the target of therapy 

The broad application of NPBAT is to scenarios where 
samples are ascertained based on selection criteria that 
are correlated with the phenotype of interest. 

Conclusions 

In conclusion, the key advantage that defines the attrac- 
tion of the proposed approach is its robustness against 
model specification of the phenotypes. This enables 
extensions to different types of traits and the integration 
of complex statistical models for the phenotype. While, at 
the same time, the validity of the approach is not com- 
promised by such generalization. Though the power is 
sensitive to the offset choice, NPBAT is valid regardless 
of the offset. As with all population-based association 



Table 1 This table displays the p-values for the association between the Fagerstrom Test for Nicotine Dependence (FTND) 
and the markers listed above for the different statistical tests: the NPBAT where E x = x c is the genotypic mean of the 
current smokers, NPBAT where E x = Xf is the genotypic mean of the former smokers, the Improved Score Test and a 
linear regression 



Method 


NPBAT: E x = x c 


NPBAT: E x = x f 


Improved Score Test 


Regression 


rs1051730 


0.00134 


0.00138 


0.00227 


0.00259 


rs8034191 


0.00386 


0.00391 


0.00694 


0.00744 
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tests, population stratification can be a problem. Adjust- 
ing for known population sub-structure using principal 
components of ancestral informative markers (AIMs) or 
using genomic controls can reduce the impact of popu- 
lation stratification. The NPBAT software package which 
implements this method is detailed in the Appendix. 

Appendix 

Appendix A: Offset choice when Y is binary 

The following considers the offset choice for the coded 
trait T when Y is binary. Assume the phenotype of interest 
is binary and the genotype of interest follows an additive 
model. Let rn, ri, and r 2 denote the number of cases with 
0, 1, and 2 disease alleles, respectively. Let R denote the 
total number of cases. Let S denote the total number of 
controls. Let no, n\, and n 2 denote the number of cases 
and controls with 0, 1, and 2 disease alleles, respectively. 
Let N = S + R denote the total number of cases and 
controls. In this scenario, the standard statistical method 
used is the Cochran- Armitage Trend test which can be 
written as follows: 



^Cochran — 



N {n + 2r 2 ) - R (ni + 2n 2 ) 
7(f) (N («i + 4« 2 ) - («i + 2n 2 ) 2 ) 



(6) 



In this scenario, let the coded phenotype T{ = Y[ — 
fly where fi y is the offset. The NPBAT statistic has the 
following form: 

N (n + 2r 2 ) -R(ni+ 2n 2 ) 



R 



s 



SR(N{ni+^n 2 )-{ni+2n 2 ) 2 ) N 
N-l 



(7) 



Note that the numerators of both statistics are the same. 
The ratio of the test statistics can be written as follows: 



StCL ^Cochran 
Stat^PB AT 



2 + (1 + Y ) (1 - fly) 2 



(8) 



where y = Controls * Given tms ra ti°> the power of the 
NPBAT statistic relative to the Cochran- Armitage trend 



test is maximized for the offset choice \iy timal = 



#cases 



r 



N -. For example, if the ratio of the cases versus the con- 
trols is 1, the offset choice \i y is \, This corresponds to 
equally weighting the cases and controls in the conditional 

test statistic. For large sample size N, such that J ' jjz^ ^ 1> 
the ratio of the test statistics is approximately one when 
the offset is set to \i y ptimal = Consequently, for the 

optimal offset choice, the test statistics are approximately 
the same. 



Appendix B: asymptotic distribution when the secondary 
phenotype is available for both the cases and controls 

To derive the asymptotic distribution of the NPBAT statis- 
tics for various phenotypic offset choices, let o\ denote 
the variance of X and Oy denote the variance of Y. Let 
| \a\ | denote the Euclidean norm. Let r o ff set = ((Yi — 
rof&et)...(r» - ^offset))' and let T„ = (T^,...,T^y = 
((Yi - Y)...{Y n - Y)Y where = {Yi - Y). Let X* = 

(Xi-X)T ltj 



{Xi- 

X*T L 



■X, ...,X n -X). Define Z t = 



L . Then J2 z i = 

i=l 



|p . By treating X as random given Y is fixed, it can 
be shown that the Z/s are independent, E{Z\) = 0 and 

Var^J^Zj^ = 1. The Lindberg condition [25] for Z/, 

which ensures asymptotic normality of ^ Z/, is then given 
by 



V6 > 0 : lim n - 




ZfdP 



\Zi\>€ 



= 0 



(9) 



Since Z/ has a discrete distribution, the Lindberg condi- 
tion can only be fulfilled when the integration set {\Z{\ > 
6} is empty for n —> oo. Since X is the coded genotype 
and Y is a biological quantity, assume d x ^ 0, o y ^ 0 and 
both are finite. Then, there exists some constant K such 



\(Xi-X)\\T u 

that |V , J V 11 

0 X Oy 

condition by 



< K. Hence we rewrite the Lindberg 



We > 0:6 < \Zi\ = 



UXi-X)\\Tu 



K 



0 as n oo 



(10) 



Hence the integral in the Lindberg condition is always 
computed over a set that is empty for n — > oo. Thus the 
Lindberg condition is always fulfilled when the regularity 
condition holds. Then the Lindberg theorem [26] implies 
convergence to normality. Then 



(ID 



Note that the statistic is maximized and has a standard 
normal distribution when Yoff se t = E[Y]. 

Appendix C: asymptotic distribution when the secondary 
phenotype is only available for the cases 

Here, we derive the asymptotic distribution of the NPBAT 
statistic for secondary phenotypes in case/control studies. 
Consider a case control study where genetic information 
is available for both the cases and the controls, but the 
phenotypic information is only available for the cases. 
Here n is only the number of cases and all summations 
are only over the number of cases since the phenotypic 
information is not available for the controls where as in 
Appendix B, n is the number of cases and controls and 



Lutz et a I. BMC Genetics 201 3, 1 4:1 3 
http://www.biomedcentral.eom/1 471-21 56/1 4/1 3 



Page 7 of 8 



the summation is over both the number of cases and con- 
trols. Let X cases denote the sample mean of the genotypes 
of the cases and o\ be the true variance of the genotypes. 
Let E x = X contro i s be the sample mean of the genotypes 
of the controls. Under the null hypothesis and assuming 
no population stratification, the sample mean of the geno- 
types of the cases and the sample mean of the genotypes 
of the controls both converge to E[X] since X is not asso- 
ciated with Y. Let X text = (X\ — X text ...X n — X text )^ where 
text=C2ises or controls, meaning X\.X n is the coded geno- 
type of the cases but X can be computed based on the 
cases, the controls, or both. Define 



(Xi — ^control) (Xi — ^offset) 

^v / lir /i || 2 + 2(r-r offset )2 



(12) 



then 



i=l 



yt hp 
yv control x 



^ynr /x |p + 2(r- 

V n(X C3LSe 



yt t 

^case 1 M 



Offset) 2 
" -^control) (Y — ^offset) 



+ 2(Y-Y oSset ) 2 



(13) 



It is important to note that the Z/s are independent, 
E(Zi) = 0 and Var ( Yl Zt ) = 1» which is obtained 



0 and Var\J2 z iJ 

by first taking the conditional expectation treating X as 
random and Y as fixed. The Lindberg condition [25] for 
Z/, which ensures asymptotic normality of ^ Z/, is then 
given by 



Ve > 0 : lim n 



It f z ' dp 



= 0 



(14) 



Since Z; has a discrete distribution, the Lindberg condi- 
tion can only be fulfilled when the integration set {\Z{\ > 
e} is empty for n oo. Since X is the coded genotype 
and Y is a biological quantity, assume 6 X ^ 0, <f y ^ 0 and 
both are finite. Then, there exists some constant K such 
that K^-^controOll^-l < K Hence we rewrite the L i n dberg 

condition by 



V6 > 0:6 < \Zi\ = 



| (Xi — ^control) | I Ti \ 

^7ll^ll 2 + 2(r-r 0 fFset) 2 



\(Xi -X contro i) \Ti\ K 
< J ! < — 

0*113^11 n 



0 as n -> oo 



(15) 



Hence the integral in the Lindberg condition is always 
computed over a set that is empty for n —> oo. Thus the 
Lindberg condition is always fulfilled when the regularity 



condition holds. Then the Lindberg theorem [26] implies 
convergence to normality. Then 

, Im ' StofrPBAT = Y,Zi ^ d N(0, 1) 

A /lir /x || 2 + 2(r-r offset ) 2 i= i 

(16) 



Then the NPBAT statistic is normally distributed with 
mean zero and variance given above. Note that the vari- 
ance is always greater than or equal to one and equals one 
when = E[Y] . Note that if r offeet = Y and E x = 

^controls then NPBAT has a standard normal distribution. 
As seen in the Simulations section and Figure 1, when 
Ex is based on the the controls and the phenotype infor- 
mation is only available for the cases, then the power is 
maximized when Yoffset 7^ Y because the variance equals 
the minimum when Yoffset ^ E[Y]. 

Appendix D: NPBAT software 

A software package implemented in C++ to compute both 
single phenotype and multiple phenotypes NPBAT statis- 
tics is available for download at the following website: 
https://sites.google.com/site/genenpbat/. In addition to 
NPBAT statistics, other population based statistics such 
as the Armitage Trend Test, Fisher Exact Test are also 
available. Currently, only two platforms are supported: 
linux64 and windows64. The NPBAT software package 
reads in genetic data through the PLINK style pedigree 
(ped), map (map) and phenotype (phe) files. The website 
provides detail information on how to use the software 
package. 
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